Goto

Collaborating Authors

 compression scheme




Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget

Vardhan, Harsh, Mazumdar, Arya

arXiv.org Machine Learning

Distributed high dimensional mean estimation is a common aggregation routine used often in distributed optimization methods. Most of these applications call for a communication-constrained setting where vectors, whose mean is to be estimated, have to be compressed before sharing. One could independently encode and decode these to achieve compression, but that overlooks the fact that these vectors are often close to each other. To exploit these similarities, recently Suresh et al., 2022, Jhunjhunwala et al., 2021, Jiang et al, 2023, proposed multiple correlation-aware compression schemes. However, in most cases, the correlations have to be known for these schemes to work. Moreover, a theoretical analysis of graceful degradation of these correlation-aware compression schemes with increasing dissimilarity is limited to only the $\ell_2$-error in the literature. In this paper, we propose four different collaborative compression schemes that agnostically exploit the similarities among vectors in a distributed setting. Our schemes are all simple to implement and computationally efficient, while resulting in big savings in communication. The analysis of our proposed schemes show how the $\ell_2$, $\ell_\infty$ and cosine estimation error varies with the degree of similarity among vectors.


Multiclass Boosting: Simple and Intuitive Weak Learning Criteria

Neural Information Processing Systems

We study a generalization of boosting to the multiclass setting. We introduce a weak learning condition for multiclass classification that captures the original notion of weak learnability as being "slightly better than random guessing".


Unlabelled Sample Compression Schemes for Intersection-Closed Classes and Extremal Classes

Neural Information Processing Systems

The sample compressibility of concept classes plays an important role in learning theory, as a sufficient condition for PAC learnability, and more recently as an avenue for robust generalisation in adaptive data analysis. Whether compression schemes of size $O(d)$ must necessarily exist for all classes of VC dimension $d$ is unknown, but conjectured to be true by Warmuth. Recently Chalopin, Chepoi, Moran, and Warmuth (2018) gave a beautiful unlabelled sample compression scheme of size VC dimension for all maximum classes: classes that meet the Sauer-Shelah-Perles Lemma with equality. They also offered a counterexample to compression schemes based on a promising approach known as corner peeling. In this paper we simplify and extend their proof technique to deal with so-called extremal classes of VC dimension $d$ which contain maximum classes of VC dimension $d-1$. A criterion is given which would imply that all extremal classes admit unlabelled compression schemes of size $d$. We also prove that all intersection-closed classes with VC dimension $d$ admit unlabelled compression schemes of size at most $11d$.


Impossibility Results for Grammar-Compressed Linear Algebra

Neural Information Processing Systems

To handle vast amounts of data, it is natural and popular to compress vectors and matrices. When we compress a vector from size N down to size n << N, it certainly makes it easier to store and transmit efficiently, but does it also make it easier to process? In this paper we consider lossless compression schemes, and ask if we can run our computations on the compressed data as efficiently as if the original data was that small. That is, if an operation has time complexity T(input-size), can we perform it on the compressed representation in time T(n) rather than T(N)? We consider the most basic linear algebra operations: inner product, matrix-vector multiplication, and matrix multiplication.



Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

Surbhi Goel, Adam Klivans

Neural Information Processing Systems

We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to eigenvalue decay of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g.